[SPARK-8288][SQL] ScalaReflection can use companion object constructor #23062

drewrobb · 2018-11-16T23:46:52Z

What changes were proposed in this pull request?

This change fixes a particular scenario where default spark SQL can't encode (thrift) types that are generated by twitter scrooge. These types are a trait that extends scala.ProductX with a constructor defined only in a companion object, rather than a actual case class. The actual case class used is child class, but that type is almost never referred to in code. The type has no corresponding constructor symbol and causes an exception. For all other purposes, these classes act just like case classes, so it is unfortunate that spark SQL can't serialize them nicely as it can actual case classes. For an full example of a scrooge codegen class, see https://gist.github.com/anonymous/ba13d4b612396ca72725eaa989900314.

This change catches the case where the type has no constructor but does have an apply method on the type's companion object. This allows for thrift types to be serialized/deserialized with implicit encoders the same way as normal case classes. This fix had to be done in three places where the constructor is assumed to be an actual constructor:

In serializing, determining the schema for the dataframe relies on inspecting its constructor (ScalaReflection.constructParams). Here we fall back to using the companion constructor arguments.
In deserializing or evaluating, in the java codegen ( NewInstance.doGenCode), the type couldn't be constructed with the new keyword. If there is no constructor, we change the constructor call to try the companion constructor.
In deserializing or evaluating, without codegen, the constructor is directly invoked (NewInstance.constructor). This was fixed with scala reflection to get the actual companion apply method.

The return type of findConstructor was changed because the companion apply method constructor can't be represented as a java.lang.reflect.Constructor.

There might be situations in which this approach would also fail in a new way, but it does at a minimum work for the specific scrooge example and will not impact cases that were already succeeding prior to this change

Note: this fix does not enable using scrooge thrift enums, additional work for this is necessary. With this patch, it seems like you could patch com.twitter.scrooge.ThriftEnum to extend _root_.scala.Product1[Int] with def _1 = value to get spark's implicit encoders to handle enums, but I've yet to use this method myself.

Note: I previously opened a PR for this issue, but only was able to fix case 1) there: #18766

How was this patch tested?

I've fixed all 3 cases and added two tests that use a case class that is similar to scrooge generated one. The test in ScalaReflectionSuite checks 1), and the additional asserting in ObjectExpressionsSuite checks 2) and 3).

SparkQA · 2018-11-17T03:37:03Z

Test build #98940 has finished for PR 23062 at commit e27f933.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Probably OK. My concern is whether this still works in Scala 2.11 and 2.12, and some style issues

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala

...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala

drewrobb · 2018-11-20T00:24:57Z

Thanks much for the prompt feedback. PTAL at changes. I had to modify the ScroogeLikeExample class to make the additional DatasetSuite test work-- my prior example had removed methods from the scrooge code that were actually necessary. Kept as separate commits for review, but can squash later?

RE Scala 2.12 are there any specific things you are worried about I could look at? I'm not super familiar with changes in 2.12, and I think 2.11 should be fine, as this was developed using 2.11. And I assume the QA build is testing on 2.12 also?

Also this might not be the place to ask but it is rather ungreppable-- have you encountered compilation generated java files in ./org/ (that directory is also not in .gitignore!) breaking compilation in sbt? In my development environment I'll frequently have to delete that entire directory for compilation to work again.

srowen

I don't have specific concern about Scala 2.11 or 2.12, just know that this is the kind of thing that can break across versions. The PR builder will use 2.12 and we do have other tests for 2.11.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala

SparkQA · 2018-11-20T01:52:25Z

Test build #99033 has finished for PR 23062 at commit 0074693.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

srowen · 2018-11-20T02:01:00Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala

+  override def canEqual(other: Any): Boolean = other.isInstanceOf[ScroogeLikeExample]
+
+  private def _equals(x: ScroogeLikeExample, y: ScroogeLikeExample): Boolean =
+      x.productArity == y.productArity &&


Sure, but you've already established it's a ScroogeLikeExample here. Why must it be Product1 just to check whether it's also Product1? seems like it's not adding anything. In fact, why not just compare the one element that this trait knows about? to the extent it can implement equals() meaningfully, that's all it is doing already.

My previous answer was not complete. Product1 is also necessary so that the implicit Encoders.product[T <: Product : TypeTag] will work with this class, if omitted the DatasetSuite test will not compile:

[error] /home/drew/spark/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala:1577: value toDS is not a member of Seq[org.apache.spark.sql.catalyst.ScroogeLikeExample] [error] val ds = data.toDS

I could add some new encoder, but I think that might be worse as the goal of this PR is for Scrooge classes to work with the provided implicit encoders.

Just use SparkSession.createDataset?

Hm, actually that probably won't work any more or less. OK, it's because there is an Encoder for Product. You can still simplify the equals() and so on I think, but looks like that's easier than a new Encoder. Or is it sufficient to test a Seq of a concrete subtype of ScroogeLikeExample?

I'm worried about changing the tests to use a concrete subtype, because the reflection calls might behave differently in that case either now or later on. I simplified a little more. canEqual is necessary to implement product. equals is necessary or tests will not pass (it will check object pointer equality), and hashCode is needed for scalastyle to pass since equals is necessary.

That's OK, leave in Product, if it's actually testing the case you have in mind. Yes I know equals() is needed. The new implementation looks good.

SparkQA · 2018-11-20T02:57:17Z

Test build #99039 has finished for PR 23062 at commit 83a1987.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-20T03:58:11Z

Test build #99029 has finished for PR 23062 at commit 39979e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-20T03:58:50Z

Test build #99031 has finished for PR 23062 at commit 4cba3cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-20T07:08:24Z

Test build #99043 has finished for PR 23062 at commit 8c25fd0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-20T07:25:16Z

Test build #99044 has finished for PR 23062 at commit 04a34c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2018-11-20T14:54:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

-    Option(ConstructorUtils.getMatchingAccessibleConstructor(cls, paramTypes: _*))
+  def findConstructor[T](cls: Class[T], paramTypes: Seq[Class[_]]): Option[Seq[AnyRef] => T] = {
+    Option(ConstructorUtils.getMatchingAccessibleConstructor(cls, paramTypes: _*)) match {
+      case Some(c) => Some(x => c.newInstance(x: _*).asInstanceOf[T])


I think this cast isn't needed because Class[T].newInstance returns T already, but it's fine to leave.

srowen · 2018-11-21T15:38:45Z

Merged to master

## What changes were proposed in this pull request? This change fixes a particular scenario where default spark SQL can't encode (thrift) types that are generated by twitter scrooge. These types are a trait that extends `scala.ProductX` with a constructor defined only in a companion object, rather than a actual case class. The actual case class used is child class, but that type is almost never referred to in code. The type has no corresponding constructor symbol and causes an exception. For all other purposes, these classes act just like case classes, so it is unfortunate that spark SQL can't serialize them nicely as it can actual case classes. For an full example of a scrooge codegen class, see https://gist.github.com/anonymous/ba13d4b612396ca72725eaa989900314. This change catches the case where the type has no constructor but does have an `apply` method on the type's companion object. This allows for thrift types to be serialized/deserialized with implicit encoders the same way as normal case classes. This fix had to be done in three places where the constructor is assumed to be an actual constructor: 1) In serializing, determining the schema for the dataframe relies on inspecting its constructor (`ScalaReflection.constructParams`). Here we fall back to using the companion constructor arguments. 2) In deserializing or evaluating, in the java codegen ( `NewInstance.doGenCode`), the type couldn't be constructed with the new keyword. If there is no constructor, we change the constructor call to try the companion constructor. 3) In deserializing or evaluating, without codegen, the constructor is directly invoked (`NewInstance.constructor`). This was fixed with scala reflection to get the actual companion apply method. The return type of `findConstructor` was changed because the companion apply method constructor can't be represented as a `java.lang.reflect.Constructor`. There might be situations in which this approach would also fail in a new way, but it does at a minimum work for the specific scrooge example and will not impact cases that were already succeeding prior to this change Note: this fix does not enable using scrooge thrift enums, additional work for this is necessary. With this patch, it seems like you could patch `com.twitter.scrooge.ThriftEnum` to extend `_root_.scala.Product1[Int]` with `def _1 = value` to get spark's implicit encoders to handle enums, but I've yet to use this method myself. Note: I previously opened a PR for this issue, but only was able to fix case 1) there: apache#18766 ## How was this patch tested? I've fixed all 3 cases and added two tests that use a case class that is similar to scrooge generated one. The test in ScalaReflectionSuite checks 1), and the additional asserting in ObjectExpressionsSuite checks 2) and 3). Closes apache#23062 from drewrobb/SPARK-8288. Authored-by: Drew Robb <[email protected]> Signed-off-by: Sean Owen <[email protected]>

### What changes were proposed in this pull request? Fixes encoding of classes that uses companion object constructors in the interpreted path. Without this change the that is added in this change would fail with ``` ... Cause: java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: Couldn't find a valid constructor on interface org.apache.spark.sql.catalyst.ScroogeLikeExample newInstance(interface org.apache.spark.sql.catalyst.ScroogeLikeExample) at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1199) ... ``` As far as I can tell this bug has existed since the initial implementation in SPARK-8288 #23062 The existing spec that tested this part of the code incorrectly provided an outerPointer which hid the bug from that test. ### Why are the changes needed? Fixes a bug, the new spec in the ExpressionsEncoderSuite shows that this is in fact a bug. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug. ### How was this patch tested? New and existing specs in ExpressionEncoderSuite and ObjectExpressionsSuite. Closes #37837 from eejbyfeldt/spark-40385. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Fixes encoding of classes that uses companion object constructors in the interpreted path. Without this change the that is added in this change would fail with ``` ... Cause: java.lang.RuntimeException: Error while decoding: java.lang.RuntimeException: Couldn't find a valid constructor on interface org.apache.spark.sql.catalyst.ScroogeLikeExample newInstance(interface org.apache.spark.sql.catalyst.ScroogeLikeExample) at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionDecodingError(QueryExecutionErrors.scala:1199) ... ``` As far as I can tell this bug has existed since the initial implementation in SPARK-8288 #23062 The existing spec that tested this part of the code incorrectly provided an outerPointer which hid the bug from that test. ### Why are the changes needed? Fixes a bug, the new spec in the ExpressionsEncoderSuite shows that this is in fact a bug. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug. ### How was this patch tested? New and existing specs in ExpressionEncoderSuite and ObjectExpressionsSuite. Closes #37837 from eejbyfeldt/spark-40385. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 73e3c36) Signed-off-by: Hyukjin Kwon <[email protected]>

ScalaReflection can use companion object constructor

e27f933

srowen requested changes Nov 17, 2018

View reviewed changes

viirya reviewed Nov 18, 2018

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala Outdated Show resolved Hide resolved

...talyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ObjectExpressionsSuite.scala Show resolved Hide resolved

drewrobb added 2 commits November 19, 2018 16:09

feedback

39979e4

move import location

4cba3cc

srowen requested changes Nov 20, 2018

View reviewed changes

feedback

0074693

srowen requested changes Nov 20, 2018

View reviewed changes

feedback

83a1987

drewrobb added 2 commits November 19, 2018 19:17

fix scalastyle

8c25fd0

feedback

04a34c4

srowen approved these changes Nov 20, 2018

View reviewed changes

asfgit closed this in 6bbdf34 Nov 21, 2018

mickjermsurawong-stripe mentioned this pull request Sep 10, 2019

[SPARK-29026][SQL] Improve error message in schemaFor in trait without companion object constructor #25736

Closed

eejbyfeldt mentioned this pull request Sep 8, 2022

[SPARK-40385][SQL] Fix interpreted path for companion object constructor #37837

Closed

[SPARK-8288][SQL] ScalaReflection can use companion object constructor #23062

[SPARK-8288][SQL] ScalaReflection can use companion object constructor #23062

Uh oh!

Conversation

drewrobb commented Nov 16, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Nov 17, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drewrobb commented Nov 20, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Nov 20, 2018

Uh oh!

Uh oh!

Uh oh!

srowen Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

drewrobb Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

srowen Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

srowen Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

drewrobb Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

srowen Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 20, 2018

Uh oh!

SparkQA commented Nov 20, 2018

Uh oh!

SparkQA commented Nov 20, 2018

Uh oh!

SparkQA commented Nov 20, 2018

Uh oh!

SparkQA commented Nov 20, 2018

Uh oh!

srowen Nov 20, 2018

Choose a reason for hiding this comment

Uh oh!

srowen commented Nov 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants